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Accurate reconstruction of the regulatory networks that control gene expression is one of the key current challenges in 
molecular biology. Although gene expression and chromatin state dynamics are ultimately encoded by constellations of 
binding sites recognized by regulators such as transcriptions factors [TFs) and microRNAs [miRNAs), our understanding 
of this regulatory code and its context-dependent read-out remains very limited. Given that there are thousands of 
potential regulators in mammals, it is not practical to use direct experimentation to identify which of these play a key role 
for a particular system of interest. We developed a methodology that models gene expression or chromatin modifications 
in terms of genome-wide predictions of regulatory sites and completely automated it into a web-based tool called 
ISMARA (Integrated System for Motif Activity Response Analysis). Given only gene expression or chromatin state data 
across a set of samples as input, ISMARA identifies the key TFs and miRNAs driving expression /chromatin changes and 
makes detailed predictions regarding their regulatory roles. These include predicted activities of the regulators across the 
samples, their genome-wide targets, enriched gene categories among the targets, and direct interactions between the 
regulators. Applying ISMARA to data sets from well-studied systems, we show that it consistently identifies known key 
regulators ab initio. We also present a number of novel predictions including regulatory interactions in innate immunity, 
a master regulator of mucociliary differentiation, TFs consistently disregulated in cancer, and TFs that mediate specific 
chromatin modifications. 



[Supplemental material is available for this article.] 

Since the seminal work of Jacob and Monod (1961), much has been 
learned about the molecular mechanisms by which gene expres- 
sion is regulated and the molecular components involved. His- 
torically; most work has focused on transcription factors (TFs); 
arguably the most important regulators of gene expression; which 
bind to cognate sites in DNA and regulate the rate of transcription 
initiation. However, more recently it has become clear that the 
state of the chromatin, which can be modulated through modifi- 
cations of the DNA nucleobases and of the histone tails of nucle- 
osomeS; also plays a crucial role. For example; the local chromatin 
state affects the ability of TFs to access their binding sites, and the 
chromatin state can in turn be modified through TF-guided re- 
cruitment of chromatin modifying enzymes. Furthermore, an en- 
tirely new layer of post-transcriptional regulation has been un- 
covered in recent years in the form of microRNAs (miRNAs) (Bartel 
2009). These guide RNA-induced silencing complexes to target 
mRNAS; inhibiting their translation and accelerating their decay 
(Fabian et al. 2010). 

In spite of these many insights, our current understanding of 
the function of genome-wide gene regulatory networks in mam- 
mals is still rudimentary. For example, we only know the sequence 
specificity of less than half (Matys et al. 2003; Wasserman and 
Sandelin 2004; Pachkov et al. 2007) of the -1500 (Vaquerizas et al. 
2009) TFs in mammalian genomes. Our knowledge of how TF 
binding is affected by chromatin state, of the combinatorial in- 
teractions between TFs and their cof actors, and the impact of post- 
translational modifications on TF activity; is even more fragmen- 
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tary. Our understanding of the transcriptome-wide effects of 
miRNAs on gene expression remains similarly limited. Given that 
we are clearly still far from being able to develop realistic quanti- 
tative models of genome-wide gene regulatory dynamics; the most 
constructive contribution that computational approaches can 
currently provide is to develop models that help guide experi- 
mental efforts. 

Due to the dramatic decrease in high-throughput measure- 
ment costs, it has become relatively straightforward to measure 
gene expression (i.e., with microarray or RNA-seq) or chromatin 
state (with ChlP-seq) genome-wide across a set of samples for a 
particular system of interest. Consequently; researchers interested 
in a particular developmental or cellular differentiation process, or 
in the response of a tissue to a particular perturbation, have in- 
creasingly turned to genome-wide profiling of expression and 
various chromatin marks, with the aim of using such data to elu- 
cidate the key regulatory circuitry acting in their system. However; 
deriving insights into regulatory circuitry from high-throughput 
data requires sophisticated computational analysis methods. 

In recent years, comparative genomic methods have been de- 
veloped that allow relatively accurate computational prediction of 
regulatory sites for hundreds of TFs and miRNAs on a genome-wide 
scale (van Nimwegen 2007; Friedman et al. 2009; Arnold et al. 
2012a). In addition, through extensive experimental effortS; ge- 
nome-wide annotations of transcript structures (The FANTOM 
Consortium et al. 2005; Djebali et al. 2012) and promoters (Balwierz 
et al. 2009) have become available. Capitalizing on these de- 
velopments; we recently presented a general method called Motif 
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Activity Response Analysis (MARA) for inferring key gene regula- 
tory circuitry from genome-wide gene expression data by model- 
ing the observed gene expression dynamics in terms of computa- 
tionally predicted regulatory sites. We showed that this method 
can reconstruct core transcription regulatory networks in a human 
differentiation system ab initio (The FANTOM Consortium et al. 
2009). Furthermore; several recent studies confirm that computa- 
tional modeling of observed expression and chromatin dynamics 
is a powerful approach to reconstructing regulatory circuitry 
(Novershtern et al. 2011; Yosef et al. 2013) (to give just two ex- 
amples) and show that MARA-like approaches can be extended to 
include miRNA regulation (Setty et al. 2012) and the dynamics of 
genome- wide histone modifications (Arnold et al. 2012b). 

Unfortunately, applying MARA-like methods to high- 
throughput data is technically challenging and requires the ex- 
pertise of dedicated computational biology groups. Thus, whereas 
many laboratories are now routinely producing high-throughput 
data sets, and methodologies for analyzing such data have been 
described in the literature, the vast majority of groups that produce 
data have to develop collaborations with expert computational 
groups to apply these methods. Indeed, in recent years MARA has 
been applied to a large range of mammalian systems studied by 
various experimental collaborators and experimentally validated 
predicted regulatory circuitry in these systems (Summers et al. 
2010; Aceto et al. 2012; Arner et al. 2012; Hasegawa et al. 2012; 
Perez-Schindler et al. 2012; Suzuki et al. 2012; Eisele et al. 2013; 
Meier- Abt et al. 2013; Tiwari et al. 2013a,b; Vervoort et al. 2013). 
Although these studies further validated the power of the method, 
they required a considerable investment of time and effort for the 
analysis of each new data set. Through these experiences we be- 
came convinced that lack of easy access to such computational 
analysis procedures is currently a major bottleneck in the field, and 
we decided to invest our efforts into developing a completely au- 
tomated system for performing MARA. 

Here we present ISMARA (Integrated System for Motif Activity 
Response Analysis), a completely automated computational tool 
that aims to make the computational reconstruction of regulatory 
circuitry from high-throughput data easily accessible to any re- 
searcher. Given as input a set of genome-wide gene expression or 
chromatin state measurements across a number of samples, 
ISMARA uses motif activity response analysis to identify the key 
regulators (i.e., TFs and miRNAs) driving gene expression/chro- 
matin state changes across the samples, the activity profiles of 
these regulators, their target genes, and the sites on the genome 
through which these regulators act. The analysis combines pre- 
calculated annotations of regulatory sites for hundreds of regula- 
tors across genes in mammalian genomes with automated pro- 
cessing of input data, modeling and parameter inference, and post- 
processing to provide a large collection of analysis results. To use 
ISMARA, users only need to upload their data to the web server 
http://ismara.unibas.ch/ and submit it to the system, without the 
need of setting or tuning any parameters. All results are presented 
through a user-friendly graphical web interface. In ISMARA, the 
motif activity response analysis has been extended to model not 
only gene expression data from various platforms (microarray, 
RNA-seq), but essentially any sequencing data reflecting a genomic 
mark (ChlP-seq), including chromatin modifications or TF bind- 
ing. In addition, ISMARA models not only the effects of TFs on 
mammalian gene expression but also the effects of miRNAs. 

Below, we first outline the methodologies that we developed 
for automating the computational modeling and provide an 
overview of all results that ISMARA provides by applying it to RNA- 



seq data of a human tissue atlas. After this, we further demonstrate 
ISMARA using a number of example data sets that highlight dif- 
ferent aspects of the method. 

Results 

As schematically depicted in Figure 1, ISMARA takes raw gene ex- 
pression (microarray or RNA-seq) or chromatin state (ChlP-seq) 
data from any number of samples and automatically models 
this data in terms of computationally predicted regulatory sites, 
thereby predicting the genome-wide regulatory interactions that 
drive the observed expression or chromatin state changes across 
the samples. ISMARA is available through a web interface http:// 
ismara.unibas.ch/ as part of our SwissRegulon resources (Pachkov 
et al. 2007). Users can directly upload unprocessed microarray 
(CEL files), RNA-seq, or ChlP-seq data (BED or BAM files), which 
are then analyzed automatically without the need for any addi- 
tional input from the user (Fig. IB). The results are made available 
through a web interface and can also be downloaded in flat-file 
format. 

In order to be able to provide such completely automated 
analysis, ISMARA makes use of precalculated genome-wide anno- 
tations of promoters, sets of transcripts associated with each pro- 
moter, multiple alignments of promoter regions across seven 
mammals, a curated collection of mammalian regulatory motifs, 
TFBS predictions for all motifs across all promoters, and predicted 
target transcripts of miRNAs (Fig. lA). Additionally, we developed 
a substantial number of analysis procedures in order to automati- 
cally process and normalize the raw input data (Fig. IB) and 
transform them into a standardized format to which the motif 
activity response analysis can be applied (Fig. IC). The analysis 
procedures involved in all these steps are outlined in Methods and 
detailed in the Supplemental Methods. 

Overview of the analyses performed by ISMARA 

To give an overview of the analysis results that ISMARA automat- 
ically provides for any data set, and to outline how these analyses 
are performed, we applied ISMARA to an example RNA-seq data set 
of expression profiles across 16 human cell types, i.e., data from the 
Illumina Body Map 2 (GEO accession GSE30611) (IBM2). The re- 
sults obtained after submitting the raw RNA-seq data to ISMARA 
are available at http://ismara.unibas.ch/supp/datasetl_IBM/ismara_ 
report/. 

As described in Methods, ISMARA infers the motif activities 
according to a linear model (Fig. ID) using a Bayesian procedure. 
Importantly, a Gaussian prior on motif activities is used to avoid 
overfitting, and the parameter of this prior is fit automatically by 
ISMARA for each input data set using a cross-validation scheme. 
Motif activities are fitted from 80% of the promoters and the per- 
formance of the model, i.e., the fraction of the variance in Eps 
explained by the model, is assessed on the remaining 20% of 
promoters. 

Although our model fits Eps, it is important to note that it is 
not the model's aim to provide an accurate fit of the signal Eps. As 
discussed in the introduction, we do not expect the highly sim- 
plified linear model to provide an accurate fit to the signal Eps at 
individual promoters. Indeed, the model explains 7.7% of the 
variance in Eps for the IBM2 data, and across the data sets studied 
here, we find that the model typically captures 5%-15% of the 
variance of Eps across samples (Supplemental Fig. 2). Although 
these fractions are modest, given that tens of thousands of pro- 
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Figure 1 . Outline of the Integrated System for Motif Activity Response Analysis. (A) ISMARA starts from a curated genome-wide collection of promoters 
and their associated transcripts. Using a comparative genomic Bayesian methodology (Arnold et al. 201 2a), transcription factor binding sites (TFBSs) for 
—200 regulatory motifs are predicted in proximal promoters. Similarly, miRNA target sites for —100 seed families are annotated in the 3' UTRs of 
transcripts associated with each promoter (Friedman et al. 2009). (B) Users provide measurements of gene expression (microarray, RNA-seq) or chromatin 
state (ChlP-seq). The raw data are processed automatically, and a signal is calculated for each promoter in each sample. For ChlP-seq data, the signal is 
calculated from the read density in a region around the transcription start. For gene expression data, the signal is calculated from read densities across the 
associated transcripts (RNA-seq) or intensities of associated probes (microarray). (C) The site predictions and measured signals are summarized in two 
large matrices. The components Np^ of matrix N contain the total number of sites for motif m (TF or miRNA) associated with promoter p. The components 
Eps of matrix E contain the signal associated with promoter p in sample s. (D) The linear MARA model is used to explain the signal levels Eps in terms of 
bindings sites Npm and unknown motif activities /A^s, which are inferred by the model. The constants Cp and Cs correspond to basal levels for each promoter 
and sample, respectively. (£) As output, ISMARA provides the inferred motif activity profiles A^s of all motifs across the samples, s, sorted by the significance 
of the motifs. A sorted list of all predicted target promoters is provided for each motif, together with the network of known interactions between these 
targets (provided by the String database, http://string-db.org/) and a list of Gene Ontology categories that are enriched among the predicted targets. 
Finally, for each motif, a local network of predicted direct regulatory interactions with other regulators is provided. 
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moters are involved; they are extremely significant, i.e., using 
randomization of the association between site-count and expres- 
sion, we estimate that the P- value for explaining 7.7% of the var- 
iance by chance is ~10~^^^ (Supplemental Methods; Supplemental 
Fig. 3). 

ISMARA's main aim is to identify which regulatory motifs, m, 
play an important role and how these motifs contribute to Eps 
across the samples. First, ISMARA's output lists all regulatory motifs 
sorted by a Z-score, which summarizes the importance of the motif 
for explaining the expression variation across the samples. This 
score roughly corresponds to the average number of standard de- 
viations the motif activity is away from zero (see Methods and 
Supplemental Methods). Besides the Z-score of each motif, the list 
also displays the set of TFs or miRNAs that bind to sites of the 
motif, a thumbnail of its activity across the input samples, and 
a sequence logo for each motif (Supplemental Fig. 4). Following the 
link from the motif name leads to a page with a large number 
of predictions regarding the motifs precise regulatory role. To il- 
lustrate these. Figure 2 shows some of ISMARA's results for the 
HNFIA, MYB, hsa-miR-124/hsa-miR-506, and the SREBF motifs. 

HNFIA was the most significant motif for the IBM2 data set, 
and its predicted activity is highly tissue-specific, being almost en- 
tirely restricted to liver and kidney (Fig. 2A; Supplemental Figs. 5, 6). 
The associated transcription factor hepatocyte nuclear factor 1 ho- 
meobox A (HNFIA) is relatively well-studied and indeed known to be 
mainly expressed in liver, kidney, stomach, and intestine (Kuo et al. 
1990; Serfas and Tyner 1993), where it is essential for organ function 
(Pontoglio et al. 1996). Figure 2A also illustrates that the inferred 
motif activities are highly reproducible. In fact, motif activities are 
more reproducible than the expression profiles from which the motif 
activities were inferred (Supplemental Fig. 16). The reason for this 
high reproducibility of motif activities is that each motif, m, typically 
targets hundreds to thousands of promoters, and the inferred motif 
activities, Ams, are statistical averages of the behaviors of a large 
number of promoters. This averaging causes the complexities at in- 
dividual promoters to effectively cancel out and ensures that the 
overall influence of a motif can still be reliably inferred. 

For many of the regulatory motifs, there are multiple TFs that 
can bind to the sites of the motif, and it is not a priori clear which of 
the TFs is most responsible for the motif activity in a given system. 
ISMARA infers motif activities from the behavior of the predicted 
targets of the motif. That is, roughly speaking, an increased activity 
is inferred when its targets show on average an increase in ex- 
pression that cannot be explained by the presence of sites for other 
motifs in their promoters. The mRNA expression profiles of the TFs 
associated with a motif thus provide independent information 
about the link between the TFs and the motif activities, and 
ISMARA provides an analysis of the correlation between motif 
activities and the expression profiles of the associated TFs. For 
HNFIA, there is a good correlation between mRNA expression of 
the TF and the inferred motif activity (Fig. 2A, inset). However, for 
the fourth most significant motif (POU2F), only one of the three 
POU2F factors, POU2F2 (also known as OCT2), shows significant 
correlation of its mRNA level with motif activity, and it is the most 
highly expressed. This suggests that POU2F2 is mainly responsible 
for the motif activity in these tissues (Supplemental Figs. 7, 8). The 
fact that the correlation is positive also strongly suggests that 
POU2F2 acts as an activator. In contrast, whenever a negative 
correlation between motif activity and TF expression is observed, 
the TF most likely acts as a repressor, e.g., as observed for the 
known repressor ZHX2 (Supplemental Fig. 9; Kawata et al. 2003). 
However, it should be noted that motif activity does not need to be 



a direct function of TF expression, i.e., the effect of a TF on its 
targets will not only depend on its expression but possibly on post- 
translational modifications, on cellular localization, and on the 
presence of specific cof actors. Therefore, although a strong corre- 
lation between TF expression and motif activity is a good in- 
dication that the TF is responsible for the motif activity, the ab- 
sence of such a correlation does not imply that the TF is not 
involved in the motif's activity. 

ISMARA predicts individual target promoters p for each motif m 
by calculating the difference Spm of the log-likelihood of the model 
with the original site-count matrix N and the log-likelihood of 
the model in which only the binding sites for motif m in promoter 
p have been removed (Methods and Supplemental Methods). For 
each motif, a searchable and resizable list is provided of all tar- 
get promoters, their associated transcripts, and associated genes 
(Supplemental Fig. 10). For HNFIA, the accuracy of ISMARA's tar- 
get predictions is suggested by the fact that most of the top pre- 
dicted targets are supported by the literature, including some of the 
oldest known direct targets of HNFIA (Courtois et al. 1988). For 
each target promoter, ISMARA provides a link to the genome 
browser view of the promoter (Supplemental Fig. 11), showing the 
precise genomic location of the predicted regulatory site. To pro- 
vide the user with a more intuitive picture of the predicted list of 
targets of the motif, a link is provided to a network view of the 
target genes as provided by the STRING database (Jensen et al. 
2009), where network links indicate known associations between 
the genes. For HNFIA, the STRING network reveals a large, highly 
connected cluster of predicted targets that are known to be involved 
in the metabolism of drugs and toxins in the liver (Supplemental 
Fig. 12). As another means to provide insights into the pathways 
targeted by a given motif, ISMARA also provides lists of enriched 
Gene Ontology categories (Fig. 2; Supplemental Fig. 13; Ashburner 
et al. 2000), which in this case confirms that HNFIA targets genes 
involved in the metabolism of drugs and xenobiotics. 

To gain insight in the transcription regulatory networks that 
control expression profiles, it is of particular interest to identify 
direct regulatory connections between the TFs themselves. In 
ISMARA, a direct regulatory interaction from motif m to m' is 
predicted when motif m is predicted to target a promoter of one of 
the TFs associated with m'. To visualize the predicted direct regu- 
latory interactions between regulators, ISMARA provides, for each 
motif m, a local network picture that shows all predicted regulatory 
connections between m and promoters of TFs that are associ- 
ated with other motifs (Supplemental Fig. 14). The user can inter- 
actively change the cutoff on the target score Spm to draw this 
picture. For HNFIA, we find that the strongest predicted targets are 
HNF4A, F0XA2, NR5A2, dnd HNFIA itself (Supplemental Fig. 14). In 
addition, HNF4A and FOXA2 are predicted to target the HNFIA 
promoter as well. Remarkably, all these predictions are supported by 
independent experimental evidence (Piaggio et al. 1994; Boj et al. 
2001; Bartoov-Shifman et al. 2002; Tomaru et al. 2009; Bochkis et al. 
2012; Molero et al. 2012). 

ISMARA predicts that the MYB motif is by far most active in 
testis, and it targets genes that are involved in meiosis and sper- 
matogenesis (Fig. 2B). In addition, the MYB motif is predicted to 
target the RFX4, RFX2, and NR5A1 promoters. A literature search 
reveals that MYBLl, a close homolog of MYB that binds to the 
same regulatory sites, is a master regulator of male meiosis and 
spermatogenesis (Toscani et al. 1997; Bolcun-Filas et al. 2011). 
Moreover, RFX2 has been implicated as a direct target of MYBLl in 
spermatogenesis (Horvath et al. 2009). ISMARA's prediction that 
RFX4 is also regulated by the MYB motif (presumably through 
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Figure 2. Results for the lllumina Body Map 2. Each panel corresponds to a motif (indicated with name and sequence logo) and shows the inferred 
motif activities across the 16 tissues (activities with error bars in panels A and C, and activity Z-values in panels B and D). Tables show Gene Ontology 
categories enriched among predicted targets of each motif, and individual target promoters (D). The networks (B,Q show direct regulatory interactions 
between the motif and other regulators. (A) Red and black curves correspond to motif activities from two replicate measurements. The inset shows the 
correlation between motif activity and HNF1 AmRNA levels. (S)The /nsef shows that MYB is predicted to directly target the RFX4 promoter with target score 
8.1 34. (C) The regulatory network inset and GO table show that hsa-miR-1 24/hsa-miR-506 is predicted to directly target many TFs. (D) The red bars show 
Z-values of the average motif activity of the SREBF motif for samples coming from older (age 58-86) and younger (age 1 9-47) donors. 
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MYBLl) is novel to our knowledge. Finally; ISMARA's prediction 
that the MYB promoter is targeted by the E2F motif is also sup- 
ported by the literature (Campanero et al. 1999). 

To illustrate ISMARA's predictions of the regulatory role of 
miRNAs, Figure 2C shows results for the second most significant 
miRNA seed family, hsa-miR-124/hsa-miR-506. This seed family 
has the strongest negative activity in brain, and its targets are 
highly enriched for TFs (Fig. 2C). Indeed, hsa-miR-124 is a well- 
known brain-specific miRNA (Landgraf et al. 2007). Moreover, 
of the top nine predicted TF target genes of hsa-miR-124, six 
{TEADl, CEBPA, AR, SPl, SNAI2, NFATCl) are supported by 
independent experimental evidence (Lim et al. 2005; Baek et al. 
2008; Ponomarev et al. 2011; Liang et al. 2013; Shi et al. 2013), 
again confirming the high accuracy of ISMARA's target predictions. 

Of course, most of the results highlighted in Figure 2, such as 
the function of HNFIA in liver and the brain-specific role of hsa- 
miR-124, are well-known from the literature. However, all these 
results, including very specific predictions of the precise targets 
of each regulator, were obtained by a completely automated 
analysis of RNA-seq data from 16 human tissues, without any 
free parameters or specific processing of the data. Moreover, they 
constitute only a small selection of the predictions made by 
ISMARA. 

By default, ISMARA focuses on regulatory motifs that explain 
changes in expression levels across the input samples. However, 
some users may be interested in regulators that are predictive for 
a consistently high or consistently low expression level across all 
samples. To address this, ISMARA also fits the absolute expression 
levels of the promoters, i.e., averaged over all input samples, in 
terms of 'mean activities' (Methods; Supplemental Methods). For 
the IBM2 data set, we find that the TFs YYl and NRFl are most 
predictive of high average expression, whereas the known re- 
pressors REST and RREBl are most predictive for low average ex- 
pression (Supplemental Fig. 15). 

Experiments are often performed in multiple replicates, and 
ISMARA implements procedures for specifically identifying motifs 
that behave reproducibly across the replicates. The ISMARA results 
page links to a section where users can provide batch and replicate 
annotation for their samples, which is then used by ISMARA to 
calculate motif activity profiles that are averaged over replicates 
using a rigorous Bayesian procedure (Supplemental Methods). In 
addition, updated motif Z-scores quantify to what extent a motif's 
activity varies across samples in a way that is reproducible across 
the replicates (Supplemental Methods). For example, the replicate- 
averaged results for the IBM2 data set are available at http://ismara. 
unibas. ch/ supp/ dataset 1_IBM/ averaged_replicates/ averaged_report/. 

Apart from replicate averaging, this procedure can further be 
used to calculate contrasts between subsets of samples. To illustrate 
this, we noted that the samples of the 16 tissues of the IBM2 data 
set derived from donors of different ages, and we investigated 
whether any motifs have consistently different activities between 
samples from older and younger individuals. We divided the 
samples into those deriving from donors aged 19-47 and those 
deriving from donors aged 58-86. We then directed ISMARA to 
calculate averaged activities for 'young' and 'old' samples for 
each motif (results at http://ismara.unibas.ch/supp/datasetl_IBM/ 
averaged_age/averaged_report/). We found that only the SREBF 
motif is significantly differently regulated between old and young 
samples (Fig. 2D). The targets of SREBF are up-regulated in older 
tissues relative to the younger ones and are highly enriched for 
lysosomal genes. Lysosomes are responsible for the degradation of 
many macromolecules, including proteins, and increase in lyso- 



somal mass is a well-known characteristic of aging and senescence 
in cells (Cuervo and Dice 2000; Kurz et al. 2000). In addition, ev- 
idence is increasing that a progressive decrease in the efficiency 
of autophagy and lysosomes with age plays a key role in aging- 
associated degenerative changes in mammals (Rubinsztein et al. 
2011). Several recent findings support that SREBP TFs play a key 
role in these processes. SREBF 1 expression increases with age in rat 
brains (Okamoto et al. 2006), SREBF 1 -mediated lipogenesis is in- 
volved in senescence (Kim et al. 2010), SREBF2 regulates autophagy 
(Seo et al. 201 1), and SREBF activity is regulated by mTOR complex 
1 (Peterson et al. 2011). It is remarkable that simply by contrasting 
motif activities in tissues from younger and older donors, ISMARA 
was able to automatically identify SREBF as a key regulator of aging- 
related changes in expression of lysosomal genes. 

As another example of the power of motif activity contrasts 
across sets of samples, we searched for motifs consistently dis- 
regulated in cancer by joint analysis of the human GNF atlas of 79 
tissues and cell lines (Su et al. 2004) and the NCI-60 reference 
cancer cell lines (Ross et al. 2000) (for full results, see http://ismara. 
unibas. ch/supp/dataset2/ismara_report/). Supplemental Tables 2 
and 3 show the motifs that are most consistently up-regulated or 
down-regulated in cancers, including miRNAs. As discussed in the 
Supplemental Material, many of the top disregulated motifs, such 
as HIFl A and hsa-miR-205 miRNA (Supplemental Fig. 1 7), are well- 
known in cancer biology, again supporting the accuracy of ISMARA's 
predictions. Besides well-known oncogenes and tumor suppres- 
sors, ISMARA also makes several novel predictions of regulators 
consistently disregulated in cancers, including the TFs HANDl, 
KLF12, BPTF, FOXD3, and ZNF143. 

Inferring motif activity dynamics: inflammatory response 

To illustrate ISMARA's analysis of time series data, we applied it to 
a time series of expression data obtained after activation of human 
umbilical vein endothelial cells (HUVECs) with tumor necrosis 
factor (TNF, also known as TNF-alpha). Messenger RNA expression 
was measured every 15 min for the first 4 h after treatment, and 
every 30 min for the next 4 h (Wada et al. 2009). Whereas the 
original study focused solely on nascent transcription, we here show 
that standard application of ISMARA to this data set (http://ismara. 
unibas. ch/supp/dataset3/ismara_report/) uncovers the transcrip- 
tion regulatory network involved in this inflammatory response in 
remarkable detail. 

The response of endothelial cells to TNF is known to be me- 
diated by TFs of the NFkappaB family, GATA2, IRFl, and JUN 
(Inoue et al. 2006) TFs. TFs of the NFkappaB family in particular are 
crucial for the resulting inflammatory response (Kempe et al. 
2005). Indeed, ISMARA infers that the two most significant motifs 
are IRF1,2,7 and NFKBl/REL/RELA. The activity of NFKBl/REL/ 
RELA increases sharply in the first 45 min and slower afterward, 
until it reaches a steady activity after 3 h. The activity of the 
IRFl, 2, 7 motif increases steadily starting at 30 to 45 min after 
treatment until the end of the time course (Fig. 3 A). As shown by 
NFKBl/REL/RELA's local network figure (Fig. 3B and on the 
ISMARA results website), ISMARA predicts that IRFl is activated 
directly at the level of transcription by these regulators, which is 
confirmed by the experimental literature (Harada et al. 1994). 
Other predicted targets of NFKBl/REL/RELA that are also signifi- 
cantly up-regulated in this process are TNF receptor genes, com- 
ponents of the JAK-STAT pathway (note that STAT2,4,6 is the 1 1th 
most significant motif, indicating that STAT activity changes, af- 
fecting the level of its targets) and MHC class I genes. The latter are 
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also predicted to be regulated by IRF1,2,7, which is confirmed by 
experimental data (Ten et al. 1993). ISMARA makes the novel 
predictions that both NFKBl/REL/RELA and IRF1,2,7 activate the 
fifth most significant motif, PRDMl, which is an important 
developmental regulator in the B-cell and T-cell lineages and is 
required for the secretory pathway in B-cells (Martins and Calame 
2008). PRDMl activity increases, like that of IRF, across the entire 
time course, and these two regulators appear to share many of their 
predicted targets, including type 1 interferon pathway genes, the 
immunoproteasome (Seifert et al. 2010), ubiquitin conjugating 
enzymes, antigen peptide transporters, and MHC class I genes. 
These targets suggest that the IRF and PRDMl TFs may be re- 
sponsible for activation of the antigen presenting pathway. 

Although our TFBS predictions incorporate cross-species 
conservation analysis, this does not mean that the predicted tar- 
gets must be conserved across mammals. For example, the third 
most significant TF target of the IRF motif is the ATF5 promoter, 
which is targeted through a TFBS that is primate-specific (Supple- 
mental Fig. 18). 

To provide an example assessment of the accuracy of ISMARA's 
genome-wide target predictions, we compared the predicted tar- 
gets of NFKBl/REL/RELA with targets identified through ChlP-seq 
in lymphoblastoid cell lines derived from 10 individuals of African, 
European, and Asian ancestry (Kasowski et al. 2010). We find that 
almost two-thirds of the top 50 targets, >50% of the top 150 tar- 
gets, and —40% of the top 300 targets are supported by ChlP-seq 
binding at the promoter (Supplemental Fig. 19). To put these 
numbers in perspective, we compared the validation of ISMARA's 
targets with the variability in NFKBl/REL/RELA binding across 
individuals and replicate samples. We used the ChlP-seq data from 
each sample to predict target promoters, and then Validated' these 
'predictions' using the other ChlP-seq data sets in complete anal- 



ogy to the way we validated ISMARA's targets. The typical valida- 
tion rate for the ChlP-seq data was higher than for the ISMARA 
target predictions, i.e., 60%-70% versus 40%-66%. This is not 
surprising given that all ChlP-seq data were obtained in the same 
lymphoblastoid cell type, which differs from the HUVEC cells. 
Still, we found significant variability across the ChlP-seq data sets, 
and the targets from some ChlP-seq data sets had lower in- 
tersection with the other ChlP-seq data sets than ISMARA's targets 
(Supplemental Fig. 19). This analysis shows that ISMARA's ge- 
nome-wide predictions can reach accuracies comparable to those 
obtained from a ChlP-seq study. 

Finally, the third most significant motif is XBPl, which is 
activated only after 2.5 h. Its predicted targets are highly over- 
represented for endoplasmic reticulum (ER) genes and genes in- 
volved in vesicle-mediated and Golgi transport, consistent with 
the fact that XBPl is a major regulator of ER stress and the unfolded 
protein response (UPR) (Glimcher 2010). Moreover, several studies 
support that the UPR is a general characteristic resulting from in- 
flammation or TNF activation in endothelial cells (Gargalovic et al. 
2006; Civelek et al. 2009). Interestingly, the induction of XBPl's 
activity occurs at the same time that the NFKBl/REL/RELA activity 
stops increasing, which is in line with studies showing that the 
UPR can attenuate the induction of inflammation as mediated by 
TFs of the NFkappaB family (Kaser et al. 2008; Kitamura 2011; Li 
et al. 201 1). The induction of XBPl's activity is not reflected in the 
expression of XPBl itself, which is almost constant across the time 
course (Supplemental Fig. 20). This underscores that ISMARA in- 
fers a motifs activity from the expression of its predicted targets 
and does not use the regulator's own expression. Indeed, it has 
been established that XBPl activity is regulated post-transcrip- 
tionally through alternative splicing (Yoshida et al. 2001; Gallon 
et al. 2002). Together, these results demonstrate that ISMARA re- 




Figure 3. Analysis of an inflammatory response time series of human umbilical vein endothelial cells responding to TNF. (A) Time-dependent activities 
of the three most significant motifs, i.e., NFKB1 /REL/RELA (red), IRF1 /2/7 (black), and XBPl (blue). Error bars denote standard deviations of the inferred 
activities. (B) Summary of the inferred core regulatory network. Selected top motifs are shown together with interactions between them and pathways/ 
functional categories that are enriched among the targets of these motifs. The intensity of the color corresponds to the Z-score of the motif, its time- 
dependent activity is indicated inside the node, and the thickness of each edge corresponds to its target score Spm. 
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constructs the core regulatory circuitry of the innate immune re- 
sponse in HUVEC cells (Fig. 3B) ab initio using only time course 
expression data. 

Identifying novel master regulators: mucociliary 
differentiation of bronchial epithelial cells 

Next, we turned to an example system for which much less is 
known, namely the mucociliary differentiation of bronchial epi- 
thelial cells on an air-liquid interface. Aiming to elucidate the 
regulation of bronchial development, Ross et al. (2007) performed 
differentiation experiments in triplicate over a period of 28 days 
with cells from three separate donors. This data was then analyzed 
with commonly used bioinformatic procedures, i.e., genes were 
clustered into coexpression clusters, and the clusters were analyzed 
for over-represented gene ontology categories and pathways. This 
analysis uncovered clusters associated with TGF-beta pathway 
genes, extra-cellular adhesion genes, and genes associated with the 
microtubule cytoskeleton, but no key regulators or regulatory in- 
teractions that drive these expression changes were identified. 

In contrast, applying ISMARA to this gene expression data set, 
we obtain the prediction that by far the most important regulatory 
motif in this system is RFX, whose activity is strongly increasing 
over the period from roughly day 4 to day 10 in all three donors 
(Fig. 4A) (http://ismara.unibas.ch/supp/dataset4/ismara_report/). 
The predicted targets of RFX are highly enriched in genes known to 
be associated with cilium assembly, axoneme, and the microtubule 



cytoskeleton genes (Fig. 4B), suggesting that RFX directs cilio- 
genesis in bronchial epithelial cells. 

The RFX family of TFs contains seven members, and it is not 
a priori clear which of these are driving the bronchial differentia- 
tion. Comparison of the mRNA expression profiles with activity 
profiles shows that two of the family members, RFX2 and RFX3, 
exhibit a striking correlation in their expression with the motif 
activity (Fig. 4A,C). Together these results strongly suggest that the 
TFs RFX2/3 are master regulators of ciliogenesis in this system. This 
prediction is consistent with previous studies that have shown that 
Rfx3 is necessary for the ciliogenesis of nodal cilia in mouse em- 
bryonic development (Bonnafe et al. 2004) and during ciliogenesis 
of motile cilia in a mouse cell-culture system (El Zein et al. 2009). 

Strikingly, ISMARA's results on the IBM2 data set also identi- 
fied the RFX motif as the key regulator of ciliogenesis in sper- 
matogenesis. As discussed above, in that system ISMARA predicted 
that the RFX2 and RFX4 promoters were directly targeted by the 
MYB motif (most likely through the MYBLl TF). Here we find that 
ISMARA predicts MYB to target the RFX2 promoter in the muco- 
ciliary differentiation system as well (Fig. 4B). In addition, ISMARA's 
prediction that RFX directly up-regulates FOXfl in this system was 
also made in the results on the IBM2 data set. Indeed, Rfx3 was 
found to activate Foxjl during ciliogenesis in the mouse cell- 
culture system mentioned above (El Zein et al. 2009). These ob- 
servations suggest that the core regulatory network involved in 
ciliogenesis, with MYBLl targeting RFX promoters and RFX TFs 
targeting FOX/i, is conserved across multiple mammalian systems. 




days after cells were cultured ^^t'^'^V ^"^^'^^ 

Figure 4. Mucociliary differentiation. (A) Inferred RFX motif activity profile in mucociliary differentiation of bronchial epithelial cells from three independent 
donors (black, red, and blue lines). (B) Key predicted regulators and their targets in this system. Selected top motifs are shown together with predicted 
interactions between them and pathways/functional categories that are enriched among predicted targets of these motifs. The intensity of the color cor- 
responds to the Z-score of the motif, its time-dependent activity for each donor is indicated inside the node, and the thickness of the edges corresponds to the 
target score Spm. (C) mRNA expression profiles of the RFX2 (solid) and RFX3 (dashed) genes across the differentiation (colors of the donors as in A). 
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As indicated in Figure 4B; ISMARA additionally predicts that, 
in this system, IRF1,2,7 up-regulates innate immune response 
genes, and a short spike of E2F activity up-regulates cell-cycle genes 
at day 1. Finally, there is a group of motifs (TBP, FOS_FOS{B,Ll}_ 
JUN{B,D}, RXR{A,B,G}, HOX{A6,A7,B6,B7}, and GLI1..3) whose 
targets are progressively down-regulated across the differentiation 
time course. The targets of these motifs are generally enriched for 
extracellular proteins involved in cell adhesion, cell-cell junctions, 
and signaling. More specifically, targets of GLI1..3 involve genes 
from the TGF-beta pathway, targets of TBP involve nucleosomal 
and intermediate filament cytoskeletal genes, and targets of the 
homeodomain motif (HOX{A6,A7,B6,B7}) are enriched for de- 
velopmental genes and transcription factors. The genes in these 
pathways are most likely involved in the transition of the tissue 
from squamous to columnar epithelial that occurs during differ- 
entiation. Thus, in contrast to the methods used in the original 
study (Ross et al. 2007), ISMARA predicts which regulators are 
directing various aspects of the differentiation process, including 
ciliogenesis, the innate immune response, and the transition from 
squamous to stratified epithelial. As far as we are aware, these 
predictions of the core regulatory network controlling mucociliary 
differentiation are all novel. 

interactions between TPs and miRNAs: epithelial-mesenchyme 
transition 

To illustrate ISMARA's ability to integrate the role of both TFs and 
miRNAs in the gene regulatory network, we took advantage of data 
from a system in which miRNAs are known to play important 
regulatory roles: the epithelial-to-mesenchymal transition (EMT). 
We applied ISMARA to expression measurements from epithelial 
and mesenchymal subpopulations (Scheel et al. 2011) (for results, 
see http://ismara.unibas.ch/supp/dataset5/ismara_report/) and used 
replicate averaging to identify regulators that explain the differ- 
ences between epithelial and mesenchymal cells (for results, see 
http:// ismara.unibas . ch/ supp/ dataset5/averaged_report/) . As dis- 
cussed in the Supplemental Material and Supplemental Figure 21, 
ISMARA automatically inferred much of the key regulatory in- 
teractions between TFs and miRNAs involved in EMT (for a review, 
see Polyak and Weinberg 2009) using only the gene expression data. 

IF activities affecting cliromatin state: analysis of CliiP-seq data 

Beyond analyzing gene expression data, motif activity response 
analysis can be applied to modeling any signal along the genome 
in terms of the local occurrence of TFBSs. Indeed, in recent work 
(Arnold et al. 2012b) we applied the MARA approach to ChlP-seq 
data mapping the dynamics of trimethylation at lysine 27 of his- 
tone 3 (H3K27me3) and identified TFs involved in recruiting this 
epigenetic mark that is set by the Polycomb system. In ISMARA, 
the analysis of ChlP-seq data has now been completely automated. 
In particular, given a ChlP-seq data set, ISMARA quantifies the 
signal at all promoters across all samples and models this in terms 
of the TFBSs at each promoter. For the details of ISMARA's pro- 
cessing and normalization of the ChlP-seq data, see Methods and 
Supplemental Methods. Similarly to the transcriptomic data, 
ISMARA thus by default focuses on the variation in ChlP-seq sig- 
nals at promoters only. However, the approach can easily be applied 
genome-wide; and to allow expert users to apply MARA to any 
collection of genomic regions, the ISMARA website includes an 
'expert mode' that allows users to upload their own signal and site- 
count matrices and apply MARA with these matrices. 



To illustrate ISMARA's results on ChlP-seq data, we make use 
of data from the ENCODE Project in which, besides gene expres- 
sion, nine different chromatin marks were measured across eight 
different cell types (all modifications and cell types are listed in 
Supplemental Tables 4 and 5) (Ernst et al. 2011). We first ran 
ISMARA separately on each of the 10 data sets, i.e., expression and 
nine chromatin modifications (see Supplemental Table 6 for the 
URLs of the results on all data sets). We observed that motifs that 
are highly significant for explaining differences in levels of a par- 
ticular chromatin mark across tissues were often also highly sig- 
nificant for explaining mRNA expression differences. This was 
particularly the case for methylation of lysine 4 on histone H3 
(H3K4me2, H3K4me3), for acetylation of histone H3 (H3K9ac, 
H3K27ac), and for trimethylation of lysine 36 on histone H3 
(H3K36me3). For example, Figure 5A shows the activity profiles for 
these marks for the SNAIL. 3 motif, which is recognized by the 
snail TFs (see Supplemental Fig. 22 for additional examples). As is 
clear from these figures, for these motifs the activity profile for 
expression is highly similar to those of all of these histone marks. 
Indeed, this reflects that these chromatin marks are associated with 
promoter activity (Wang et al. 2008), and several recent studies 
have shown that the levels of these marks can be used to predict 
gene expression levels (Karlic et al. 2010; Dong et al. 2012; Tippmann 
et al. 2012). 

To investigate the correlations between the levels of the dif- 
ferent chromatin marks more quantitatively, we performed prin- 
cipal component analysis (PGA) of the levels of the 10 different 
marks across all promoters, separately for each sample (Supple- 
mental Methods). Strikingly, we find that in each sample the first 
principal component explains the majority of the variance across 
promoters, typically explaining —60% of the total variance (Sup- 
plemental Fig. 23). Moreover, we find that the first principal com- 
ponent looks virtually identical for each sample (Supplemental 
Fig. 23), and Figure 5B shows the first principal component obtained 
using PGA on the pooled data from all cell types. The first principal 
vector has its highest positive component along the expression 
axis; and the activation-associated marks H3K4me3, H3K4me2, 
H3K9ac, H3K27ac, and H3K36me3 also all have a strong positive 
component in this vector, whereas the known repressive mark 
H3K27me3 has a negative component. These findings strongly 
suggest that variation along the first principal vector corresponds 
roughly to variation in 'promoter activity.' In addition, the fact 
that this first principal vector is identical in all tissues suggests that 
the relative levels of the different marks in this first principal vector 
result not from tissue-specific but from general factors, e.g., con- 
ceivably they may result from the general transcription machinery 
recruiting chromatin modifying enzymes. 

Because the variation in promoter activity captures almost 
two-thirds of the variation in all 10 measured levels at the pro- 
moter, any motif explaining variation in expression will also ap- 
pear to explain variation in all chromatin marks associated with 
promoter activity and confounds identification of TFs that are 
involved in affecting specific marks. To address this, for each motif 
we discarded the part of its activity profile along the first PGA 
component, retaining only variation in motif activities orthogonal 
to promoter activity. As illustrated in Figure 5G and Supplemental 
Figure 22, after removal of the first principal component, there are 
no longer any obvious correlations in the remaining motif activity 
profiles for different activating marks. 

We next analyzed the remaining motif activities and calcu- 
lated, for each motif and each mark, a Z-value quantifying the 
motifs contribution to explaining the mark's levels and also a 
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Figure 5. ISMARA predicts TFs involved in recruiting specific chromatin marks. (A) Activity across cell types of the SNAIl ..3 motif in explaining expression 
(black), and levels of the chromatin marks H3K4me3 (dark green), H3K4me2 (light green), H3K9ac (dark blue), H3K27ac (light blue), and H3K36me3 
(brown). (6) First principal component explaining the majority of variation in chromatin mark levels across all cell types. The bars indicate the relative 
contributions to the principal component of each mark. (C) Motif activities of the SNAIl ..3 motif, as in 4, but after removal of the first principal component. (D) 
Z-values and specificities (see text) of motifs for explaining H3K27me3 levels. The REST motif, with both highest Z-value and highest specificity, is indicated in 
red. (£) As in D, for H3K9ac levels. The two most significant motifs are shown in red. (F) As in D and f, for H3K27ac levels. (C) Activity, after removal of the first 
principal component, of the RFX motif for explaining H3K9ac (dark blue) and H3K27ac (light blue) levels. (H) As in G, for the ATF5_CREB motif. 
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'specificity' that measures the fraction of a motifs overall signifi- 
cance that is associated with a given mark (Supplemental 
Methods). Strikingly, we find that for many of the marks, the 
motifs that most significantly affect the mark are also among the 
most specific for that mark. For example, REST is the motif with 
the highest Z-value for H3K27me3 levels and is also by far the 
most specific for H3K27me3 (Fig. 5D). Indeed, in recent work 
(Arnold et al. 2012b), we showed that REST is involved in recruit- 
ing this mark during the differentiation of murine embryonic stem 
cells into pyramidal neurons, specifically at the neural progenitor 
state. With respect to the two acetylation marks, i.e., H3K9ac and 
H3K27ac, we find that the same two motifs, i.e., RFX and ATF/ 
CREB, are most significant for both these marks (Fig. 5E,F). It is well 
known that ATF/CREB TFs can recruit histone acetylases (HATs), 
such as CREB binding protein (CREBBP) and EP300 (Yuan and 
Gambee 2001); and for RFX TFs it has also been established that 
they can recruit HATs at particular promoters (Masternak et al. 
2003). Our results thus suggest that recruitment of HATs by TFs 
bound to ATF/CREB and RFX motifs make an important contri- 
bution to genome-wide histone acetylation at promoters. More- 
over, the activity profiles of these motifs for H3K9ac and H3K27ac 
are highly similar, suggesting that these two marks may be re- 
cruited through a common or highly overlapping pathways. Sup- 
plemental Figure 24 shows the most significant motifs for each 
of the other marks. Among the additional predictions made by 
ISMARA is that the PITX motif is associated with both mono- and 
dimethylation of lysine 4 of histone 3. This prediction is supported 
by recent biochemical evidence that PITX2 can recruit methyl- 
transferases that methylate H3K4 (Can et al. 2011). As expected, 
CTCF is the most significant motif explaining CTCF binding. 
ISMARA also makes several predictions that are completely novel, 
as far as we have been able to determine: It predicts that the he- 
patocyte nuclear factors HNFIA and HNF4A have the most sig- 
nificant effect on the levels of the H3K36me3 mark, which is 
known to be set by elongating RNA polymerase (Kizer et al. 2005; 
Yuan et al. 2009), and the YYl and the NF-Y complex (consisting of 
NFYA; NFYB; and NFYC) most significantly explain variations in 
H4K20mel levels. 

Discussion 

The advent of high-throughput technologies now allows the rou- 
tine measurement of genome-wide mRNA expression across con- 
ditions, and such data in principle provide the opportunity to 
systematically investigate gene regulation on a genome-wide scale 
across different model systems. However, a major bottleneck in the 
field is that such investigations require sophisticated computa- 
tional approaches that are not available to most experimental re- 
searchers. Here we have presented ISMARA, a completely auto- 
mated system that enables any researcher to apply sophisticated 
computational modeling, on data from their system of interest, 
and obtain concrete predictions on the key regulators acting in 
their system, their activities, their genome-wide targets, and so on. 

That the computational model at the core of ISMARA, i.e., 
motif activity response analysis, is a powerful method for recon- 
structing regulatory interactions from high-throughput data has 
already been demonstrated, not only in its original application 
(The FANTOM Consortium et al. 2009), but in a substantial 
number of recent studies across a wide range of mammalian sys- 
tems (Summers et al. 2010; Aceto et al. 2012; Arner et al. 2012; 
Arnold et al. 2012b; The FANTOM Consortium et al. 2012; Hasegawa 
et al. 2012; Perez-Schindler et al. 2012; Eisele et al. 2013; Meier-Abt 



et al. 2013; Tiwari et al. 2013a,b; Vervoort et al. 2013). In each of 
these studies, MARA successfully inferred key regulators and their 
regulatory interactions ab initio. The applications in this work 
not only further confirm that in systems where key regulatory 
interactions are already known, ISMARA successfully infers them, 
but it also provides a large collection of novel regulatory pre- 
dictions across different systems in human and mouse, e.g., novel 
regulators that are disregulated in cancers, novel regulatory in- 
teractions in the inflammatory response, and the core regulatory 
circuitry involved in mucociliary differentiation and ciliogenesis. 
We believe that by empowering experimental researchers to auto- 
matically apply this approach to their own data, ISMARA can make 
a substantial contribution to the study of gene regulatory networks. 

The applications we presented highlighted several of ISMARA's 
advantages. First, by inferring a regulator's activity from the be- 
havior of its targets, ISMARA does not rely on changes in a TF's 
expression to infer activity changes and readily detects activity 
changes due to alternative splicing, post-translation modifica- 
tions, changes in cellular localization, etc. Second, when motif 
activity is transcriptionally regulated, comparing motif activity with 
TF expression allows ISMARA to identify the relevant TF(s), i.e., 
as illustrated by the identification of RFX2 and RFX3 as the key 
regulators of mucociliary differentiation. Such comparisons can 
also indicate whether a regulator acts as a repressor or an activator. 
An important goal of ISMARA is to provide predictions that are 
amenable to direct experimental follow-up. In this respect, the 
GO enrichment and STRING network analysis are typically very 
helpful in identifying the biological processes and pathways tar- 
geted by each motif, often suggesting potential markers for ex- 
perimentally validating their predicted regulatory roles. Similarly, 
ISMARA's predictions of direct regulatory interactions between the 
key regulatory motifs provide concrete hypotheses regarding the 
regulatory circuitry that is acting in a given system, e.g., the pre- 
dicted regulatory feedbacks between NFKBl/REL/RELA, IRF TFs, 
and PRDMl, or the prediction that MYBLl is an upstream activator 
of RFX TFs in ciliogenesis. Moreover, the links to a genome browser 
view of the individual binding sites on the genome (Pachkov et al. 
2007) allow for targeted validation of such individual regulatory 
interactions. There are many indications that the actions of miRNAs 
and TFs are tightly integrated (Cui et al. 2006; Hornstein and 
Shomron 2006; Zhou et al. 2007), and ISMARA's incorporation of 
miRNA regulation allows for the automated identification of regu- 
latory interactions between TFs and miRNAs as demonstrated by the 
analysis of the EMT data. Finally, gene expression regulation involves 
a tight interplay between the actions of TFs and changes in the 
chromatin state. ISMARA's ability to not only model expression data, 
but any ChlP-seq signal at promoters genome-wide, allows for 
the identification of key TFs that are involved in dynamic regulation 
of the chromatin state, as exemplified here by the analysis of ChlP- 
seq data from the ENCODE Project, which predicted, among other 
things, regulatory factors involved in recruiting histone acetylations. 

There are of course several limitations to ISMARA's current 
approach which we aim to address in future work. First, using 
a simple linear model (Gao et al. 2004) has the advantage of being 
exactly solvable, but it ignores saturation effects that undoubtedly 
occur in reality. Second, the approach currently assumes that a 
given TF acts either mainly as an activator or mainly as a repressor, 
whereas it is clear that some TFs can act as an activator on some 
targets and as a repressor on others. Indeed, it has been recently 
shown (Bauer et al. 2010) that allowing such dual function of TFs 
can significantly increase correlation between model predictions 
and measurement. Explicitly considering higher-order constella- 
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tions of TFBSS; e.g., the occurrence of pairs or triplets of TFBSs for 
particular combinations of TFs, is another extension that we are 
currently evaluating. The regulatory motifs currently included 
in ISMARA represent —350 of the roughly 1500 mammalian TFs. 
However, through developments in protein array technology 
(Bulyk 2006) and the decreasing cost in ChlP-seq experiments, 
regulatory motifs for a rapidly increasing number of additional 
mammalian TFs have recently become available. We are currently 
working on curating a new, highly extended set of regulatory 
motifs, which we expect to incorporate into ISMARA in the near 
future. 

Finally, ISMARA currently focuses solely on predicted TFBSs 
in proximal promoters, ignoring the effects of distal enhancers. In 
contrast to promoters, accurate genome-wide maps of enhancers 
have not been available until recently. However, the discovery 
that active enhancers exhibit characteristic chromatin modifica- 
tion patterns (Heintzman et al. 2009), DNA methylation patterns 
(Stadler et al. 201 1), and more generally DNA accessibility patterns 
(Boyle et al. 2008), has now led to the first genome-wide mappings 
of enhancers in specific cell types (Shen et al. 2012). If a set of 
relevant enhancers for a particular system of interest is available, it 
is in principle straightforward to predict TFBSs in these enhancers, 
and we are currently developing methodology for automatically 
incorporating the effects of TFBSs at distal enhancers into MARA. 
However, enhancers are highly cell-type specific; and in many 
cases, the data that users upload to ISMARA may come from sys- 
tems for which no accurate mappings of distal enhancers are 
available. Therefore, automated incorporation of the effects of 
distal enhancers into ISMARA will only be possible when general 
methods for mapping active enhancers in any system have become 
available. Of course, the dynamics of chromatin accessibility and 
enhancer activity are themselves also controlled by constellations 
of regulatory sites on the genome, and our ultimate goal is to de- 
velop computational models that are able to predict genome-wide 
DNA accessibility and enhancer activity in terms of local constel- 
lations of regulatory sites. 

Methods 

In this section we outline the methods that were used for au- 
tomated processing and modeling of the data. More detailed 
descriptions of all procedures are provided in Supplemental 
Methods. 

Promoteromes and regulatory site predictions 

For each model organism of interest (in this work we will focus 
exclusively on data from human and mouse), ISMARA relies on 
two precalculated resources: a genome-wide annotation of pro- 
moters and a comprehensive collection of transcription factor 
binding site (TFBS) predictions in all promoters (Fig. 1A,C). The 
genome-wide annotation of promoters in human and mouse, i.e., 
so-called "promoteromes," were constructed primarily from deep 
sequencing data of transcription start sites (deepCAGE data) 
(de Hoon and Hayashizaki 2008) using Bayesian methods that we 
described previously (Balwierz et al. 2009). To infer expression 
levels of promoters from microarrays of RNA-seq data, it is neces- 
sary to associate all promoters with the transcripts that they drive. 
We thus collected the 5' ends of all known mRNA mappings from 
the UCSC Genome Database, filtered these for mapping quality, 
and clustered all promoters and 5' ends that are within 150 base 
pairs (bps). In this way, we obtained comprehensive sets of pro- 
moters and their associated transcripts for both human (36,383 



promoters) and mouse (34,050 promoters). We also classified the 
promoters into CpG-rich and CpG-poor promoters based on their 
CG and CpG content. 

We next comprehensively predicted TFBSs in the proximal 
promoter regions of all promoters. Briefly, we curated a collection 
of 190 WMs representing —350 mammalian TFs using data from the 
JASPAR (Wasserman and Sandelin 2004) and TRANSFAC (Matys 
et al. 2003) databases, additional motifs from the literature, and our 
own analysis of ChlP-chip and ChlP-seq data. For each promoter, we 
extracted 500 bps upstream of and downstream from the TSS and 
orthologous segments in six other mammals. The seven orthologous 
sequences were then multiply aligned using T-Coffee (Notredame 
et al. 2000). Using the 190 regulatory motifs and a phylogenetic tree 
of the species (Supplemental Fig. 1) as input, we then applied our 
MotEvo algorithm (Arnold et al. 2012a) to predict functional TFBSs 
for all TF regulatory motifs across all promoters in human and 
mouse (Fig. 1A,C). MotEvo is a Bayesian algorithm that considers all 
possible ways in which configurations of binding sites for all 
motifs, as well as additional conserved elements of unknown 
function, can be assigned to the input alignments, calculating likeli- 
hoods for all configurations using a rigorous model of the evolution of 
TFBSs and neutral sequence across the phylogeny. Since different 
motifs show different positioning preferences and abundances relative 
to TSS, which differ between CpG-rich and CpG-poor promoters, we 
also incorporated position-dependent prior probabilities for all motifs, 
separately for CpG and non-CpG promoters. We summarize the 
TFBS predictions in a matrix N, where Npm is the sum of the pos- 
terior probabilities of all predicted TFBSs for motif m in promoter p. 

When modeling expression levels in terms of regulatory sites 
using a linear model, it is relatively straightforward to extend the 
modeling to not only include effects of TFBSs but also the effects 
of miRNA regulation, e.g., as recently introduced in a supervised 
learning scheme for modeling regulation in glioblastomas (Setty 
et al. 2012). In ISMARA, the effects of miRNA regulation have been 
incorporated into a completely automated procedure that can be 
applied to any expression data set. Specifically, we used miRNA 
target site predictions from TargetScan using preferential conser- 
vation scoring (Per) (Friedman et al. 2009), which assigns target 
scores for 86 miRNA seed families to all RefSeq transcripts. To 
associate a target score Npm for miRNA seed family m targeting 
promoter p, we average TargetScan's scores over all transcripts as- 
sociated with promoter p. 

Processing of raw microarray, ChlP-seq, and RNA-seq data 

To perform ISMARA analysis, the user only needs to upload raw 
microarray (i.e., GEL files), RNA-seq, or ChlP-seq (BED or BAM 
files) data. The latter should contain the genomic mappings of the 
raw sequencing reads. The first part of ISMARA's analysis consists 
of processing these raw data into a matrix E, where Eps denotes the 
'signal' associated with promoter p for sample 5. 

When gene expression data is provided in the form of 
microarrays, ISMARA first automatically detects the particular type 
of microarray used and then applies corrections for background 
and unspecific binding tailored to that microarray type. Micro- 
array platforms currently supported by ISMARA are listed in Sup- 
plemental Table 1. Using Gaussian mixture modeling, probes are 
classified into 'expressed' and 'non-expressed' for each sample. 
Probes that are consistently nonexpressed are removed and the 
intensities of the remaining probes are quantile normalized. In- 
stead of relying on annotation of the manufacturer, we map all 
probe sequences to all transcripts associated with our promoters. 
The final log-expression of a given promoter is given by a weighted 
average of the log-intensities of all probes mapping to the tran- 
scripts associated with the promoter. 
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In many applications of next-generation sequencing data, 
a main aim of the analysis is to detect genomic regions that are 
significantly enriched, or transcripts that are significantly differ- 
entially expressed, so that the analysis crucially depends on the 
noise statistics of sequencing data (Balwierz et al. 2009; Anders and 
Huber 2010). In contrast, ISMARA aims to model the variation in 
'signal' Eps, i.e., the amount of chromatin immunoprecipitation or 
the amount of expression, across promoters p and samples s in 
terms of predicted TFBSs. Our aim is thus not to assess the statis- 
tical significance of changes in the signal, but to estimate the rel- 
ative strength of the signal across promoters and conditions. 
When processing ChlP-seq data, the signal Eps is calculated as the 
estimated logarithm of the fraction of reads in sample s that map to 
a 2-kilobase region centered on promoter p. To avoid large fluctu- 
ations in Eps at promoters with low signal due to sequencing noise, 
this estimate involves using a uniform prior distribution across the 
genome. 

When processing RNA-seq data, the mapped reads are first 
mapped to our transcript set in a weighted manner. That is, 
when a read maps to n separate transcripts, each transcript's read 
count is incremented by 1/n. The expression of each transcript is 
then estimated by dividing its read count by transcript length, 
and the expression of a promoter is calculated by summing the 
expression of the transcripts associated with it. The final level Eps 
is the logarithm of the estimated number of transcripts per 
million transcripts in the cells of sample s that derived from 
promoter p. 



inference of motif activities 

At the core of ISMARA is the MARA model (The FANTOM 
Consortium et al. 2009) which, similar to previous linear mod- 
eling approaches (Gao et al. 2004; Nguyen and D'haeseleer 2006), 
assumes that the 'signal' at each promoter pisd linear function of 
its binding sites Np^: 



Eps =Cs + Cp + Yu^pm^ms + noiSC, 



(1) 



where Cp is a term reflecting the average activity of promoter p 
across the samples; q reflects the total expression in sample s; and the 



maximum likelihood estimates. 

As a result. Equation (1) is renormalized into 



45 = S^pm^m5 + noise, 



(2) 



where the matrix E' is obtained by subtracting the row and column 
averages from the entries of E. Similarly, N' is obtained by sub- 
tracting the column averages, i.e., the average number of sites <N^> 
for each motif m, from the entries of N. Finally the activities A' are 
obtained by subtracting the average motif activities across the 
samples from the activities Ams- That is, in Equation (2) the ex- 
pression changes across the samples, and promoters are modeled in 
terms of changes in site counts across promoters and changes in 
motif activities across the samples. 

As explained in the Supplemental Methods, the noise term in 
the above equation is dominated not by measurement or bi- 
ological replicate noise, but by the error in the model, and we 
assume these errors are Gaussian distributed with an unknown 
variance, o^, that is integrated out of the likelihood. To infer 
the activities, ISMARA uses a Bayesian procedure that combines 



the Gaussian likelihood model for the difference between the 
measured signal E^^ and the predicted signal with a Gaussian prior 
distribution for the activities. This prior distribution, which fa- 
vors small activities, is used to avoid overfitting. Its parameter is 
estimated automatically using 80/20 cross-validation: The activ- 
ities are inferred on a randomly chosen 'training set' of 80% of 
the promoters, and the prior's parameter is set so as to maximize 
the fit of the predicted expression profiles on the 'test set' con- 
sisting of the remaining 20% of the promoters. In this way, 
ISMARA automatically adapts its prior to each data set that is 
submitted. 

The final posterior distribution of motif activities is a multi- 
variate Gaussian, which is determined using singular value de- 
composition (see Supplemental Methods). By projecting the mul- 
tivariate Gaussian onto individual motifs, ISMARA also calculates 
standard deviations 8A^^ on all motif activities. Finally, the overall 
significance of each motif m in explaining variations in Ep^ is 
summarized by a z-like statistic. 



1 ^ 

- y 



(3) 



where 5 is the number of samples. The Z-scores calculate how 
many standard deviations away from zero on average the inferred 
motif activities are. 

Popular alternatives to a Gaussian prior include Laplacian 
priors, also referred to as Lasso regularization (Tibshirani 1996), or 
a product of Gaussian and Laplacian priors, also referred to as 
elastic net regularization (Zou and Hastie 2005). These priors are 
often considered attractive because they induce sparsity, i.e., a 
subset of the fitted parameters will be set strictly to zero. However, 
since ISMARA by default sorts motifs by their significance z^, 
motifs with weak activities move to the bottom of the list, where 
they will be ignored by most users. Moreover, in some cases a user 
might be interested in the inferred activity of a particular motif, 
even if its significance is weak, and the Gaussian prior ensures that 
a nonzero motif activity profile is inferred for every motif. 

Although users will typically be primarily interested in motif 
activity changes that explain expression changes across the con- 
ditions, in some situations it would also be interesting to fit the 
average expression {E^ of each promoter, i.e., averaged across all 
samples, in terms of average motif activities A^n- ISMARA fits such 
average activities using the same procedure, using a separate prior 
for the average motif activities A^, and fitting this prior separately 
using cross-validation. 



Target predictions 

ISMARA also predicts which individual promoters are regulated by 
each motif m. As detailed in the Supplemental Methods, for each 
promoter with predicted TFBSs for the motif (i.e., Npm > 0), ISMARA 
estimates the log-likelihood ratio Spm of the entire model with the 
TFBSs for m in p present and the model in which the entry Npm has 
been set to zero. That is, we in silico mutate the promoter p such 
that its TFBSs for motif m are removed and then recalculate the 
probability of the data E with this mutated site-count matrix, in- 
tegrating over all unknown activities. Thus, Spm rigorously quan- 
tifies how much removal of the sites for m in p decreases the fit of 
the model to the data. 

Finally, enrichment of targets within particular Gene Ontol- 
ogy categories is done by selecting all targets where inclusion of 
motif m substantially helps predicting the expression levels {Spm > 1) 
and performing a standard hypergeometric test. Target networks 
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between motifs are constructed by drawing a link from motif m to 
m' whenever m is predicted to target one of the promoters asso- 
ciated with a TF that is associated with motif m' . 



Materials 

The publically available data sets of gene expression profiling 
were obtained from Gene Expression Omnibus (http://www.ncbi. 
nlm.nih.gov/geo): time course of HUVEC after TNF treatment 
(GSE9055), mucociliary differentiation of airway epithelial cells 
(GSE5264), Novartis (GNF) SymAtlas (GSE1133), epithelial and 
mesenchymal subpopulations within immortalized human mam- 
mary epithelial cells (GSE28681), ENCODE ChlP-seq (GSE26386) 
and expression profiling (GSE26312) in human cell lines, and the 
Illumina Body Map 2 (GSE30611). Microarray files from the NCI- 
60 samples were downloaded from the project web page (http:// 
genome-www.stanford.edu/nci60/). 
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